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Abstract — This paper addresses the estimation of the latent dimensionality in nonnegative matrix factorization (NMF) with the /3- 
divergence. The /3-divergence is a family of cost functions that includes the squared Euclidean distance, Kullback-Leibler and Itakura- 
Saito divergences as special cases. Learning the model order is important as it is necessary to strike the right balance between 
data fidelity and overfitting. We propose a Bayesian model based on automatic relevance determination in which the columns of the 
dictionary matrix and the rows of the activation matrix are tied together through a common scale parameter in their prior. A family of 
majorization-minimization algorithms is proposed for maximum a posteriori (MAP) estimation. A subset of scale parameters is driven 
to a small lower bound in the course of inference, with the effect of pruning the corresponding spurious components. We demonstrate 
the efficacy and robustness of our algorithms by performing extensive experiments on synthetic data, the swimmer dataset, a music 
decomposition example and a stock price prediction task. 

Index Terms — Nonnegative matrix factorization. Model order selection, Majorization-minimization, Group-sparsity. 
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1 Introduction 

Given a data matrix V of dimensions F x N with non- 
negative entries, nonnegative matrix factorization (NMF) 
consists in finding a low-rank factorization 



V « V = WH 



(1) 



where W and H are nonnegative matrices of dimensions 
F X K and K x N , respectively. The common dimension 
K is usually chosen such that F K + K N < FN, 
hence the overall number of parameters to describe the 
data (i.e., data dimension) is reduced. Early references 
on NMF include the work of Paatero and Tapper [1] 
and a seminal contribution by Lee and Seung [2J. Since 
then, NMF has become a widely-used technique for 
non-subtractive, parts-based representation of nonneg- 
ative data. There are numerous applications of NMF 
in diverse fields, such as audio signal processing |3l, 
image classification fl], analysis of financial data \5\, and 
bioinformatics |6|. The factorization is usually sought 
after through the minimization problem 

mimmize D(V|WH) subject to W > 0, H > (2) 

where A > means that all entries of the matrix A 
are nonnegative (and not positive semidefiniteness). The 
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function _D(V|WH) is a separable measure of fit, i.e.. 



N 



(3) 



/=ln=l 



where d{x\y) is a scalar cost function of y G M+ given 
X g M+; and it equals zero when a; = y. In this paper, we 
will consider the d{x\y) to be the /3-divergence, a family 
of cost functions parametrized by a single scalar /? G K. 
The squared Euclidean (EUC) distance, the generalized 
Kullback-Leibler (KL) divergence and the Itakura-Saito 
(IS) divergence are special cases of the /3-divergence. 
NMF with the /3-divergence (or, in short, /3-NMF) was 
first considered by Cichocki et al. in [7J, and more 
detailed treatments have been proposed in l8l- |[T0l . 

1.1 Main Contributions 

In most applications, it is crucial that the "right" model 
order K is selected. If K is too small, the data does not 
fit the model well. Conversely, if K too large, overfitting 
occurs. We seek to find an elegant solution for this 
dichotomy between data fidelity and overfitting. Tradi- 
tional model selection techniques such as the Bayesian 
information criterion (BIC) |11| are not applicable in our 
setting as the number of parameters is FK + KN and 
this scales linearly with the number of data points N, 
whereas BIC assumes that the number of parameters 
stays constant as the number of data points increases. 

To ameliorate this problem, we propose a Bayesian 
model for /3-NMF based on automatic relevance deter- 
mination (ARD) fT2|/ ^rid in particular, we are inspired 
by Bayesian PCA (Principal Component Analysis) fl3l . 
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We derive computationally ejficient algorithms with mono- 
tonidty guarantees to estimate the model order K and 
to estimate the basis W and the activation coefficients 

H. The proposed algorithms are based on the use of 
auxiliary functions (local majorizations of the objective 
function). The optimization of these auxiliary fimctions 
leads directly to majorization-minimization (MM) algo- 
rithms, resulting in efficient multiplicative updates. The 
monotonicity of the objective function can be proven by 
leveraging on techniques in [91 • We show via simulations 
in Section |6] on synthetic data and real datasets (such 
as a music decomposition example) that the proposed 
algorithms recover the correct model order and produce 
better decompositions. We also describe a procedure 
based on the method of moments for adaptive and data- 
dependent selection of some of the h5^erparameters. 

I . 2 Prior Work 

To the best of our knowledge, there is fairly limited liter- 
ature on model order selection in NMF. References |14J 
and [15] describe Markov chain Monte Carlo (MCMC) 
strategies for evaluation of the model evidence in EUC- 
NMF or KL-NMF. The evidence is calculated for each 
candidate value of K, and the model with highest evi- 
dence is selected. References fT6l , llTll describe reversible 
jump MCMC approaches that allow to sample the 
model order K, along with any other parameter. These 
sampling-based methods are computationally intensive. 
Another class of methods, given in references [18J-[21J, 
is closer to the principles that underlie this work; in 
these works the number of components K is set to a 
large value and irrelevant components in W and H are 
driven to zero during inference. A detailed but qualita- 
tive comparison between our work and these methods is 
given in Section|5l In Section[6l we compare the empirical 
performance of our methods to [18J and [21 J. 

This paper is a significant extension of the authors' 
conference publication in [22]. Firstly, the cost function 
in |22 ]] was restricted to be the KL-divergence. In this 
paper, we consider a continuum of costs parameterized 
by /3, underlying different statistical noise models. We 
show that this flexibility in the cost fimction allows for 
better quality of factorization and model selection on 
various classes of real-world signals such as audio and 
images. Secondly, the algorithms described herein are 
such that the cost function monotonically decreases to a 
local minimum whereas the algorithm in [22] is heuristic. 
Convergence is guaranteed by the MM framework. 

1.3 Paper organization 

In Section 121 we state our notation and introduce ^-NMF 
and the MM technique. In Section |3l we present our 
Bayesian model for /iJ-NMF. Section |4] details ti- and ^2- 
ARD for model selection in /3-NMF. We then compare the 
proposed algorithms to other related works in Section [S] 
In Section [6l we present extensive numerical results to 
demonstrate the efficacy and robustness of £1- and £2- 
ARD. We conclude the discussion in Section [71 



Auxiliary function G(H H) 
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TABLE 1 

The form of the auxiliary function for various /3's (9]. 



2 Preliiviinaries 

2.1 Notations 

We denote by V, W and H, the data, dictionary and 
activation matrices, respectively. These nonnegative ma- 
trices are of dimensions F x N, F x K and K x N, 
respectively. The entries of these matrices are denoted 
by Vfn, Wfk and hkn respectively. The fc*'^ column of W 
is denoted by w^, G M^, and hf. e denotes the k^^ row 
of H. Thus, W = [wi, . . . , wk] and H = [h[ , . . .,hj.f. 

2.2 NIVIF with the /d-divergence 

This paper considers NMF based on the /^-divergence, 
which we now review. The /?-divergence was originally 
introduced for /3 > 1 in [23], and later generalized 
to /3 e M in [TJ, which is the definition we use here: 



Ml 
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/3 G M\{0,1} 
/3 = 1 
/3 = 



(4) 



The limiting cases /3 = and /3 = 1 correspond to 
the IS and KL-divergences, respectively. Another case 
of note is /3 = 2 which corresponds to the squared 
Euclidean distance, i.e., di3=2{x\y) — {x — y)^/2. The 
parameter (3 essentially controls the assumed statistics of 
the observation noise and can either be fixed or learnt 
from training data by cross-validation. Under certain 
assumptions, the /3-divergence can be mapped to a log- 
likelihood function for the Tweedie distribution [25|, 
parametrized with respect to its mean. In particular, the 
values /? = 0,1,2 underlie the multiplicative Gamma 
observation noise, Poisson noise and Gaussian additive 
observation noise respectively. We describe this property 
in greater detail in Section 13.21 The /3-divergence offers 
a continuum of noise statistics that interpolates between 
these three specific cases. In the following, we use the 
notation _D^(V|WH) to denote the separable cost func- 
tion in ^ with the scalar cost d — dp in @. 

2.3 IVIajorization-minimization (IVIIVI) for /^-NIUIF 

We briefly recall some results in ^ on standard /3-NMF. 
In particular, we describe how an MM algorithm [26) 
that recovers a stationary point of l|3} can be derived. 
The algorithm updates H given W, and W given H, and 
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these two steps are essentially the same by the symmetry 
of W and H by transposition (V « WH is equivalent 
to ?a H^W^). Let us thus focus on the optimization 
of H given W. The MM framework involves building 
a (nonnegative) auxiliary function G(H|H) that majorizes 
the objective C(H) = £)^(V|WH) everywhere, i.e., 

G(H|H) > C(H), (5) 

for all pairs of nonnegative matrices H, H £ M^^^. 
The auxiliary function also matches the cost function 
whenever its arguments are the same, i.e., for all H, 

G(H|H) = C{H.). (6) 

If such an auxiliary function exists and the optimization 
of G(H|H) over H for fixed H is simple, the optimization 
of G(H) may be replaced by the simpler optimization 
of G(H|H) over H. Indeed, any iterate H^^+i) such that 
G(H(*+i)|H(')) < G(H(*)|H(*)) reduces the cost since 

C(H(»+i)) < G(H(*+i)|H«) < G(H(')|h(*)) = G(H«). 

(7) 

The first inequality follows from (|5]l and the second from 
the optimality of H'^'+^'. The MM update thus consists 
in 

H(*+i) = argmin G(H|HW). (8) 

H>0 

Note that if H(*+i) = H('), a local minimum is attained 
since the inequalities in ^ are equalities. The key of 
the MM approach is thus to build an auxiliary function 
G which reasonably approximates the original objective 
at the current iterate H, and such that the function is 
easy to minimize (over the first variable H). In our 
setting, the objective function G(H) can be decomposed 
into the sum of a convex term and a concave term. As 
such, the construction proposed in |9| and [8J consists 
in majorizing the convex and concave terms separately, 
using Jensen's inequality and a first-order Taylor approx- 
imation, respectively. Denoting vfn — [WH]/„ and 

A -0-2 A ~/3-l ,a\ 

/ / 

the resulting auxiliary function can be expressed as in 
Table [H where est denote constant terms that do not 
depend on H. In the sequel, the use of the tilde over 
a parameter will generally denote its previous iterate. 
Minimization of G(H|H) with respect to (w.r.t) H thus 
leads to the following simple update 

hkn = hkn — I (10) 
\qknj 

where the exponent j{f3) is defined as 

r 1/(2-/3), /3<1 
7(/3) = < 1, l</3<2 (11) 

[ l/(/3-l), /3>2 



3 The Model for Automatic relevance 
determination in /3-nmf 

In this section, we describe our probabilistic model for 
NME The model involves tying the fc"' column of W 
to the fc*'' row of H together through a common scale 
parameter Afe. If At is driven to zero (or, as we will see, 
a positive lower bound) during inference, then all entries 
in the corresponding column of W and row of H will 
also be driven to zero. 

3.1 Priors 

We are inspired by Bayesian PCA ||T31 where each ele- 
ment of W is assigned a Gaussian prior with column- 
dependent variance-like parameters Afc. These A^'s are 
known as the relevance weights. However, our formula- 
tion has two main differences vis-a-vis Bayesian PCA. 
Firstly, there are no nonnegativity constraints in Bayesian 
PCA. Secondly, in Bayesian PCA, thanks to the simplicity 
of the statistical model (multivariate Gaussian observa- 
tions with Gaussian parameter priors), H can be easily 
integrated out of the likelihood, and the optimization can 
be done over p(W, A|V), where A = (Ai, . . . , A^) e 
is the vector of relevance weights. We have to maintain 
the nonnegativity of the elements in W and H and 
also in our setting the activation matrix H cannot be 
integrated out analytically. 

To ameliorate the above-mentioned problems, we pro- 
pose to tie the columns of W and the rows of H together 
through common scale parameters. This construction is 
not over-constraining the scales of W and H, because 
of the inherent scale indeterminacy between Wfe and hf.. 
Moreover, we choose nonnegative priors for W and H 
to ensure that all elements of the basis and activation 
matrices are nonnegative. We adopt a Bayesian approach 
and assign W and H Half-Normal or Exponential priors. 
When W and H have Half-Normal priors, 

P{wfk\\k) = 'HN{Wfk\\k), P{hkn\><k) = HAfiwknlXk), 

(12) 

where for x > 0, HM{x\X) = (d)^^^ cxp(-f^), and 
T-LN{x\\) = when a; < 0. Note that if a; is a Gaussian 
(Normal) random variable, then |a;| is a Half -Normal. 
When W and H are assigned Exponential priors, 

p(w/fc|Afe) = f (w/fc|Afc), p{hkn\'^k) = £{wkn\'^k), (13) 

where for x > 0, £{x\\) ^ iexp(— |-), and £{x\\) ~ 
otherwise. Note from (|T2l l and ((T3l l that the fc*'' column 
of W and the fc"^ row of H are tied together by a common 
variance-like parameter Afc, also known as the relevance 
weight. When a particular Afc is small, that particular 
column of W and row of H are not relevant and vice 
versa. When a row and a column are not relevant, their 
norms are close to zero and thus can be removed from 
the factorization without compromising too much on 
data fidelity. This removal of common rows and columns 
makes the model more parsimonious. 
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Finally, we impose inverse-Gamma priors on each 
relevance weight A^, i.e.. 



p{Xk;a,b) ^IQ{\k\a,b) 



b" 



-a; 



-(q+1) 



cxp 



b 



(14) 



where a and b are the (nonnegative) shape and scale 
hyperparameters respectively. We set a and b to be 
constant for all k. We will state how to choose these in 
a principled manner in Section 14.51 Furthermore, each 
relevance parameter is independent of every other, i.e., 
p{X;a,b) = Ylk=iPi^k;a,b). 

3.2 Likelihood 

The /3-divergence is related to the family of Tweedie 
distributions [25J . The relation was noted by Cichocki 
et al. |27| and detailed in |28| . The Tweedie distribution 
is a special case of the exponential dispersion model 
129], itself a generalization of the more familiar natural 
exponential family. It is characterized by the simple 
polynomial relation between its mean and variance 

^&r[x\^(t)^l^-'^, (15) 

where = E[x] is the mean, (3 is the shape parameter , and 
4> is referred to as the dispersion parameter. The Tweedie 
distribution is only defined for j3 <1 and /3 > 2. For (3 ^ 
0, 1, its probability density function (pdf) or probability 
mass function (pmf) can be written in the following form 



T(x|^, (j), (3) — h{x, (j)) exp 



1 



1 



V/3-1 



f}-i 



(16) 



where (j)) is referred to as the base function. For 
(3 € {0, 1}, the pdf or pmf takes the appropriate limiting 
form of l fT6l l. The support of T{x\fi, 4>, (3) varies with the 
value of /3, but the set of values that /i can take on is 
generally K+, except for j3 = 2, for which it is R and 
the Tweedie distribution coincides with the Gaussian 
distribution of mean /i and variance (f>. For (3 = 1 (and 
4> — 1), the Tweedie distribution coincides with the 
Poisson distribution. For /3 = 0, it coincides with the 
Gamma distribution with shape parameter a = and 
scale parameter The base function admits a closed 

form only for l3 £ {-1, 0, 1, 2}. 

Finally, the deviance of Tweedie distribution, i.e., the 
log-likelihood ratio of the saturated (/i = x) and general 
model is proportional to the /3-divergence. In particular. 



log 



T{x\n = x,(l),l3) 



1 



df^ixln), 



(17) 



Tix\p.,(j),p) 

where ( • | ■ ) is the scalar cost function defined in 
(S). As such the /3-divergence acts as a minus log- 
likelihood for the Tweedie distribution, whenever the 
latter is defined. Because the data coefficients {w/n} are 
conditionally independent given (W,H), the negative 
log-likelihood function is 

-logp(V|W,H) = ii^pfVIWH) +cst. (18) 

1. We employ the following convention for the Gamma distribution 

g{x; a, b) = x"-ie-^/V{b"r(a)). 



3.3 Objective function 

We now form the maximum a posteriori (MAP) objective 
function for the model described in Sections 13.11 and 13.21 
On account of lO, lO and lO, 



C(W, H, A) = - logp(W, H, A|V) 
= -D^{Y\WU) (/(w,) + fik 



k=l 



b) 



+ clog Afc + est, 



(19) 



(20) 



where l|20l l follows from Bayes' rule and for the two 
statistical models, 

. Half -Normal model as in ([12), /(x) = i||x||| and 

c^{F + N)/2 + a + l. 
• Exponential model as in ((T3] |. /(x) = ||x||i and c — 

F + N + a+l. 

Observe that for the regularized cost function in | |20|| , 
the second term is monotonically decreasing in A^ while 
the third term is monotonically increasing in Afc. Thus, 
a subset of the A^'s will be forced to a lower bound 
which we specify in Section l434l while the others will tend 
to a larger value. This serves the purpose of pruning 
irrelevant components out of the model. In fact, the 
vector of relevance parameters A = (Ai, . . . , Ajc) can 
be optimized analytically in j^Ul l leading to an objective 
function that is a function of W and H only, i.e., 

C(W,H) = 

1 ^ 

-Dp{Y\WB.) + c^log if{wk) + f{h) + b) + est, (21) 

^ k=l 

where est = Kc{l — logc). 

In our algorithms, instead of optimizing ^T\\ , we keep 
Afc as an auxiliary variable for optimizing C(W,H,A) 
in (|20t to ensure that the columns H and the rows of W 
are decoupled. More precisely, and h/^ are condition- 
ally independent given Afc. In fact, ||2l]| shows that the 
A-optimized objective function C(W, H) induces sparse 
regularization among groups, where the groups are pairs 
of columns and rows, i.e., {wfc,/ij.}. In this sense, our 
work is related to group LASSO [30J and its variants. 
See for example |31|. The function x i— > log(x -I- 6) in 
is a sparsity-inducing term and is related to reweighted 
^1 -minimization [32|. We discuss these connections in 
greater detail in the supplementary material 



4 Inference Algorithms 

In this section, we describe two algorithms for optimiz- 
ing the objective function (|20l l for H given fixed W. The 
updates for W are symmetric given H. These algorithms 
will be based on the MM idea for /3-NMF and on the 
two prior distributions of W and H. In particular, we 
use the auxiliary function G(H|H) defined in Table [1] as 
an upper bound of the data fit term L)^(V|WH). 
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4.1 Algorithm for fa-ARD /3-NMF 

We now introduce ^2-ARD /3-NME In this algorithm, we 
assume that W and H have Half -Normal priors as in (|T2] | 
and thus, the regularizer is 



(22) 



The main idea behind the algorithms is as follows: Con- 
sider the function F(H|H) = G(H|H)+i?2(H) which 
is the original auxiliary function G(H|H) times (j>^^ plus 
the £2 regularization term. It can in fact be easily shown 
in f9'. Sec. 6] that F(H|H) is an auxiliary function to the 
(penalized) objective fimction in l|20l l. Ideally, we would 
take the derivative of i^(H|H) w.r.t hkn and set it to zero. 
Then the updates would proceed in a manner analogous 
to iflOl l. However, the regularization term i?2(H) does 
not "fit well" with the form of the auxiliary function 
G(H|H) in the sense that VHi^(H|H) = cannot be 
solved analytically for all /3 G K. Thus, our idea for £2- 
ARD is to consider the cases /3 > 2 and /3 < 2 separately 
and to find an upper bound of F(H|H) by some other 
auxiliary function J(H|H) so that the resulting equation 
Vh'/(H|H) = can be solved in closed-form. 

To derive our algorithms, we first note the following: 

Lemma 1. For every ly > 0, the function gy{t) = i(i/* — 1) 
is monotonically non-decreasing in < e M. In fact, gv{t) is 
monotonically increasing unless v = 1. 

In the above lemma, (0) = log v by L'Hopital's rule. 
The proof of this simple result can be found in [34]. 

We first derive ^2-ARD for /3 > 2. The idea is to upper 
bound the regularizer i?2(H) in l|22jl elementwise using 
Lemma [H and is equivalent to the moving-term technique 
described by Yang and Oja in [34J, [35J. Indeed, we have 



hkn 
hkn 



- 1 



1 

< - 



hkn 
hkn 



(23) 



by taking v ~ hkn /hkn in Lemma [T] Thus, for /3 > 2, 



2A, 



< 



1 r 



hkr, 
hkr, 



+ est 



(24) 



where est is a constant w.r.t the optimization variable 
hkn- We upper bound the regularizer (|22) l elementwise 
by l|24)l . The resulting auxiliary function (modified ver- 
sion of F(H|H)) is 



J(H|H) = -G{U\U 



\^ _}_12 f ^kn 



(25) 



Note that l|24l l holds with equality iff = 1 or equiva- 
lently, hkn — hkn so (|6} holds. Thus, J(H|H) is indeed 
an auxiliary function to F(H|H). Recalling the definition 
of G(H|H) for /3 > 2 in Table [11 differentiating J(H|H) 
w.r.t hkn and setting the result to zero yields the update 



Pkr 



Qkn + {<j)/Xk)hkn 



(26) 



Algorithm 1 ^2-ARD for ^-NMF 



Input: Data matrix V, hyperparameter a, tolerance t 
Output: Normegative matrices W and H, nonnegative 
relevance vector A and model order K^s 
Init: Fix K. Initialize W e R^''^ and H G R+''^ to 
normegative values and tolerance parameter tol = cxo 
Calculate: c = (F + N)/2 + a + 1 and ^(/3) as in ^ 
Calculate: Hyperparameter 6 as in li38t 
while (tol < t) do 



H ^ H 



Wr[(WH)-<'3-i)]+0H/rcpinat(A,l,Af)^ 



[(WH)-' 



[(WH)('3-i)]H^+0V\^/repmat(A.F.l)^ 

tol ^ maxfe=i,...^if |(Afc - Afe)/Afe| 
end while 
Calculate: Kcs as in 



Note that the exponent l/(/3 — 1) corresponds to j{l3) 
for the /3 > 2 case. Also observe that the update is 
similar to MM for /?-NMF [cf. except that there is 
an additional term in the denominator. 

For the case /? < 2, our strategy is not to majorize the 
regularization term. Rather we majorize the the auxiliary 
function G(H|H) itself. By applying Lemma [T] with h' = 
hkn/ hkn, we have that for all /3 < 2, 



^kn 
hkn 



- 1 



1 

< - 

- 2 



which means that 



^kn 



^kri 
^kn 

^kn 
^kn 



est. 



(27) 



(28) 



By replacing the first term of G(H|H) in Table [J (for /3 < 
2) with the upper bound above, we have the following 
new objective function 



J(H|H) = J2 



Iknhk 



2(j) 
Pkn^kn 



hkn 
hkn 



hkn 
hkn 



/3-1 



"fcTl 



^) (29) 



Differentiating J(H|H) w.r.t hkn and setting to zero 
yields the simple update 



hkn 



Pkr, 



qkn + {4>/>'k)hkn 



1/(3-/9) 



(30) 



To summarize the algorithm concisely, we define the 
exponent used in the updates in (|26l l and (|30l l as 



m = 



1/(3-/3) /3<2 
l/(/3-l) /3>2 



(31) 



Finally, we remark that even though the updates in 
and li30t are easy to implement, we either majorized 
the regularizer i?2(H) or the auxiliary function G(H|H). 
These bounds may be loose and thus may lead to slow 
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Algorithm 2 ^i-ARD for /J-NMF 



Input: Data matrix V, hyperparameter a, tolerance r 
Output: Nonnegative matrices W and H, nonnegative 
relevance vector A and model order K^s 
Init: Fix K. Initialize W e R+'''^'' and H G M+''^ to 
nonnegative values and tolerance parameter tol = oo 
Calculate: c = F + N + a+l and ^{P) as in ^ 
Calculate: Hyperparameter h as in | |38|| 
while (tol < t) do 



H ^ H 



■7(/3) 



W^[(WH)-(«-i)]+0/rcpmat(A,l,A') 
VV ^ VV \^[cvvrH) (/3-i)]Hr+0/repmat(A,F,l) 

{Yjf Wfk + En ^kn + h) / c for all k 
tol ^ niaxfc=i^...,i4: |(Afe - \k)l^k\ 
end while 

Calculate: iCcff as in (|34l 



convergence in the resulting algorithm. In fact, we can 
show that for j3 = 0,1,2, we do not have to resort 
to upper bounding the original function F(H|H) = 
(p^^ G(H|H) + i?2(H). Instead, we can choose to solve a 
polynomial equation to update hkn- The cases ^ = 0, 1, 2 
correspond respectively to solving cubic, quadratic and 
linear equations in hkn respectively. It is also true that for 
all rational /3, we can form a polynomial equation in hkn 
but the order of the resulting polynomial depends on the 
exact value of (3. See the supplementary material 1,33 J . 

4.2 Algorithm for ii-ARD /3-NMF 

The derivation of ^i-ARD /3-NMF is similar to its £2 
counterpart. We find majorizers for either the likelihood 
or the regularizes We omit the derivations and refer the 
reader to the supplementary material l33l . In sum. 



hhri 



hh-n 



Pkr. 



, Qkn + 0/ Afc 

where 7(/3) is defined in |(TTJ. 



(32) 



4.3 Update of 

We have described how to update H using either ^i-ARD 
or ^2-ARD. Since H and W are related in a symmetric 
manner, we have also effectively described how to up- 
date W. We now describe a simple update rule for the 
Afc's. This update is the same for both £1- and £2-ARD. 
We first find the partial derivative of C(W, H, A) w.r.t 
Afc and set it to zero. This gives the update: 



A, 



fM + fihk) + b 



(33) 



where /( • ) and c are defined after | |20t . 



4.4 Stopping criterion and determination of K^s 

In this section, we describe the stopping criterion and 
the determination of the effective number of components 



Kas- Let A — (Ai, . . . , Aa-) and A = (Ai, . . . , A^-) be the 
vector of relevance weights at the current (updated) and 
previous iterations respectively. The algorithm is termi- 
nated whenever tol ~ max^^i ^ |(Afc — Afc)/Afc| falls 
below some threshold r > 0. Note from | [33l l that iterates 
of each A^ are bounded from below as Xk > B = b/c 
and this bound is attained when Wfe and hk are zero 
vectors, i.e., the fc'^ column of W and the fc*^ row of H 
are pruned out of the model. After convergence, we set 
Kcs to be the number of components of such that the 
ratio {Xk ~ B)/B is strictly larger than t, i.e.. 



K, 



off 



ke{l,...,K} : 



Xk-B 



B 



> T 



(34) 



where t > is some threshold. We choose this threshold 
to be the same as that for the tolerance level tol. 

The algorithms ^2-ARD and ^i-ARD are detailed in 
Algorithms [T] and |2] respectively. In the algorithms, we 
use the notation A • B to mean entrjrwise multiplication 
of A and B; ^ to mean entrywise division; and A ''^ to 
mean entrywise raising to the 7"^ power. In addition, 
repmat(A, 1, A^) denotes the K x N matrix with each 
column being the A vector. 

4.5 Choosing the hyperparameters 

4.5. 1 Choice of dispersion parameter 

The dispersion parameter cj) represents the tradeoff be- 
tween the data fidelity and the regularization terms 
in (|20] |. It needs to be fixed, based on prior knowl- 
edge about the noise distribution, or learned from the 
data using either cross-validation or MAP estimation. 
In the latter case, cj) is assigned a prior p{(j>) and the 
objective C(W, H, A, (p) can be optimized over <j). This 
is a standard feature in penalized likelihood approaches 
and has been widely discussed in the literature. In this 
work, we will not address the estimation of 0, but 
only study the influence the regularization term on the 
factorization given cj). In many cases, it is reasonable to 
fix (j) based on prior knowledge. In particular, imder the 
Gaussian noise assumption, ~ A/'(w/„|?)/„, cr^), and 
(3 — 2 and (j) = a^. Under the Foisson noise assumption, 
Vfn ^ 'P{'^fn\vfn), and 13 = 1 and = 1. Under 
multiplicative Gamma noise assumption, vjn — Vfn- (-fn 
and €fn is a Gamma noise of mean 1, or equivalently 
Vfn ^ Q{vkn\a,Vfn/oi), and 13 = Q and = l/a. 
In audio applications where the power spectrogram is 
to be factorized, as in Section I6.3[ the multiplicative 
exponential noise model (with a = 1) is a generally 
agreed upon assumption [3 J and thus = 1. 

4.5.2 Clioice of tiyperparameters a and b 

We now discuss how to choose the hyperparameters a 
and b in (141 in a data-dependent and principled way. 
Our method is related to the method of moments. We first 
focus on the selection of b using the sample mean of 
data, given a. Then the selection of a based on the sample 
variance of the data is discussed at the end of the section. 
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Consider the approximation in which can be writ- 
ten element-wise as 

Vfn « Vfn = '^Wfkhkn- (35) 

k 

The statistical models corresponding to shape parameter 
/? ^ (1,2) imply that E[w/„|w/„] = Vfn- We extrapolate 
this property to derive a rule for selecting the hyperpa- 
rameter 6 for all /3 € K (and for nonnegative real-valued 
data in general) even though there is no known statistical 
model governing the noise when /3 S (1,2). When FN is 
large, the law of large numbers implies that the sample 
mean of the elements in V is close to the population 
mean (with high probability), i.e., 

Av = YN'^'^f^ ~ IE[u/„] = E[vfn] = '^E[wfkhkn] (36) 

fn k 

We can compute for the Half -Normal and Expo- 

nential models using the moments of these distributions 
and those of the inverse-Gamma for Xk- These yield 

f Half-Normal 

By equating these expressions to the empirical mean /tv/ 
we conclude that we can choose b according to 

^ J <a-^)^^^ ^2-ARD 

^ (a-l)(a-2)Av ^^.ARD " ^^^^ 

In summary, b oc fiw / K and b oc (Jl^ j K)^!"^ for I2- and 
^i-ARD respectively. 

By using the empirical variance of V and the relation 
between the mean and variance of the Tweedie distribu- 
tion in (Us), we may also estimate a from the data. The 
resulting relations are more involved and these calcu- 
lations are deferred to the supplementary material [33] 
for /3 G {0,1,2}. However, experiments showed that 
the resulting learning rules for a did not consistently 
give satisfactory results, especially when FN is not 
sufficiently large. In particular, the estimates sometimes 
fall out of the parameter space, which is a known feature 
of the method of moments. Observe that a appears in the 
objective function (|2l]l only through c = (F + A^)/2 + a + l 
(^2-ARD) orc = i^ + 7V + a + l (^i-ARD). As such, the 
influence of a is moderated by + A^. Hence, if we want 
to choose a prior on a that is not too informative, then 
we should choose a to be small compared to F + N. 
Experiments in Section |6] confirm that smaller values of 
a (relative to F + N) typically produce better results. As 
discussed in the conclusion, a more robust estimation of 
a (as well as b and 0) would involve a fully Bayesian 
treatment of our problem, which is left for future work. 

5 Connections with other works 

Our work draws parallel with a few other works on 
model order selection in NMF. The closest work is flE\ 
which also proposes automatic component pruning via 



a MAP approach. It was developed during the same 
period as and independently of our earlier work [22J. 
An extension to multi-array analysis is also proposed 
in mi. In Us), Morup & Hansen consider NMF with the 
Euclidean and KL costs. They constrained the columns 
of W to have unit norm (i.e., ||wfe||2 = 1) and assumed 
that the coefficients of H are assigned exponential priors 
£{hkn\^k)- A non-informative Jeffrey's prior is further 
assumed on A^. Put together, they consider the following 
optimization over (W, H): 

minimize ^(VlWH) + V Jh + iV log 

VVf.H,A ^-^ Xk 

k 

subject to W > 0, H > 0, ||wfc||2 = 1, Vfc (39) 

where D{-\-) is either the squared Euclidean distance or 
the KL-divergence. A major difference compared to our 
objective function in | |20|| is that this method involves 
optimizing W under the constraint ||wfe||2 = 1, which 
is non- trivial. As such, to solve l|39] | the authors in [18l 
use a change of variable wj^ ^ w/i;/||wfe||2 and derive a 
heuristic multiplicative algorithm based on the ratio of 
negative and positive parts of the new objective function, 
along the lines of |36|. In contrast, our approach treats 
Wfc and ft-j, symmetrically and the updates are simple. 
Furthermore, the pruning approach in [18 1 only occurs 
in the rows H and the corresponding columns of W 
may take any nonnegative value (subject to the norm 
constraint), which makes the estimation of these columns 
of W ill-posed (i.e., the parametrization is such that a 
part of the model is not observable). In contrast, in our 
approach w^, and hf. are tied together so they converge 
to zero jointly when A^ reaches its lower bound. 

Our work is also related to the automatic rank de- 
termination method in Projective NMF proposed by 
Yang et al. in [20 1. Following the principle of PC A, 
Projective NMF seeks a nonnegative matrix W such that 
the projection of V on the subspace spanned by W best 
fits V. In other words, it is assumed that H = W-^V. 
Following ARD in Bayesian PCA as originally described 
by Bishop [13], Yang et al. consider the additive Gaussian 
noise model and propose to place half-normal priors 
with relevance parameters A^: on the columns of W. 
They describe how to adapt EM to achieve MAP esti- 
mation of W and its relevance parameters. 

Estimation of the model order in the Itakura-Saito 
NMF (multiplicative exponential noise) was addressed 
by Hoffman et al. [21 1. They employ a nonparametric 
Bayesian setting in which K is assigned a large value 
(in principle, infinite) but the model is such that only a 
finite subset of components is retained. In their model, 
the coefficients of W and H have Gamma priors with 
fixed hyperparameters and a weight parameter 9k is 
placed before each component in the factor model, i.e., 
^'/n = J2k^kWfkhkn- The Weight, akin to the relevance 
parameter in our setting, is assigned a Gamma prior with 
a sparsity-enforcing shape parameter. A difference with 
our model is the a priori independence of the factors and 
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the weights. Variational inference is used in ||2T]| . 

In contrast with the above-mentioned works, the work 
herein presents a unified framework for model selection 
in /3-NMF. The proposed algorithms have low complex- 
ity per iteration and are simple to implement, while 
decreasing the objective function at every iteration. We 
compare the performance of our algorithms to those 
in [18\ and [21] in Sections 16.31 (music decomposition) 
and 16.41 (stock price prediction). 

6 Experiments 

In this section, we present extensive numerical exper- 
iments demonstrating the robustness and efficiency of 
the proposed algorithms for (i) uncovering the correct 
model order and (ii) learning better decompositions for 
modeling nonnegative data. 

6.1 Simulations with synthetic data 

In this section, we describe experiments on synthetic 
data generated according to the model. In particular, we 
fixed a pair of hyperparameters (atrue , ^true) and sampled 
-f^tiuc — 5 relevance weights according to the inverse- 
Gamma prior in (|T4l l. Conditioned on these relevance 
weights, we sampled the elements of W and H from 
the Half-Normal or Exponential models depending on 
whether we chose to use £2- or £i-ARD. These models 
are defined in l fT2l l and ( [131 respectively. We set atmc = 50 
and fetruc = 70 for reasons that will be made clear in 
the following. We define the noiseless matrix V as WH. 
We then generated a noisy matrix V given V according 
to the three statistical models (3 — 0,1,2 corresponding 
to IS-, KL- and EUC-NMF respectively. More precisely, 
the parameters of the noise models are chosen so that 
the signal-to-noise ratio SNR in dB, defined as SNR = 
201ogio(||V||F/||V- VIIj.), is approximately 10 dB for 
each /? e {0, 1, 2}. For (3 — 0, this corresponds to an a, 
the shape parameter, of approximately 10. For (3 — 1, the 
parameterless Poisson noise model results in an integer- 
valued noisy matrix V. Since there is no noise parameter 
to select Poisson noise model, we chose fetruc so that the 
elements of the data matrix V are large enough resulting 
in an SNR « 10 dB. For the Gaussian observation model 
{(3 = 2), we can analytically solve for the noise variance 
so that the SNR is approximately 10 dB. In addition, 
we set the number of columns N — 100, the initial 
number of components K — 2 Ku uc — 10 and chose two 
different values for F, namely 50 and 500. The threshold 
value T is set to 10^^ (refer to Section 14.41 1. It was ob- 
served using this value of the threshold that the iterates 
of Afc converged to their limiting values. We ran £1- and 
^2-ARD for a e {5, 10, 25, 50, 100, 250, 500} and using b 
computed as in Section |4.5.2| The dispersion parameter (p 
is assumed known and set as in the discussion after iflSl l. 

To make fair comparisons, the data and the initializa- 
tions are the same for £2- and ^i-ARD as well as for every 
(/3,a). We averaged the inferred model order K^s over 
10 different runs. The results are displayed in Fig. [T] 



L1-ARD(F = 50) L2-ARD(F = 50) 




10' 10= 10' 10= 

hyperparameter a hyperparameter a 



Fig. 1. Estimated number of components as a function 
of tine Inyperparameter a (log-linear plot). The true model 
order is Ktmc = 5. The solid line is the mean across 10 
runs and the error bars display ± the standard deviation. 



Firstly, we observe that £i-ARD recovers the model 
order -ft^truc = 5 correctly when a < 100 and (3 G {0, 1, 2}. 
This range includes a^y^c = 50, which is the true hyper- 
parameter we generated the data from. Thus, if we use 
the correct range of values of a, and if the SNR is of the 
order 10 dB (which is reasonable in most applications), 
we are able to recover the true model order from the 
data. On the other hand, from the top right and bottom 
right plots, we see that £2-ARD is not as robust in 
recovering the right latent dimensionality. 

Secondly, note that the quality of estimation is rela- 
tively consistent across various (3's. The success of the 
proposed algorithms hinges more on the amoiint of noise 
added (i.e., the SNR) compared to which specific (3 is 
assumed. However, as discussed in Section |3l2l the shape 
parameter (3 should be chosen to reflect our belief in the 
underlying generative model and the noise statistics. 

Thirdly, observe that when more data are available 
{F — 500), the estimation quality improves significantly. 
This is evidenced by the fact that even ^2-ARD (bottom 
right plot) performs much better - it selects the right 
model order for all a < 25 and (3 e {1,2}. The estimates 
are also much more consistent across various initializa- 
tions. Indeed the standard deviations for most sets of 
experiments is zero, demonstrating that there is little or 
no variability due to random initializations. 

6.2 Simulations with the swimmer dataset 

In this section we report experiments on the swimmer 
dataset introduced in [37]. This is a synthetic dataset of 
N = 256 images each of size = 32 x 32 = 1024. Each 
image represents a swimmer composed of an invariant 
torso and four limbs, where each limb can take one 




Fig. 2. Sample images of tine noisy swimmer data. Tine 
colormap is adjusted sucin tinat black corresponds to the 
smallest data coefficient value 0) and white the 

largest (w/„ = 24). 



L1-ARD L2-ARD 




10' lo'^ lo'^ 10' lo'^ 10^ 

hyperparameter a hyperparameter a 



Fig. 3. Estimated number of components K^^ as a 
function of a for li- and ^2-ARD. The plain line is the 
average value of K^^ over the 10 runs and dashed-lines 
display ± the standard deviation. 



of four positions. We set background pixel values to 1 
and body pixel values to 10, and generated noisy data 
with Poisson noise. Sample images of the resulting noisy 
data are shown in Fig. |2l The "ground truth" number 
of components for this dataset is -fsTtmc = 16, which 
corresponds to all the different limb positions. The torso 
and background form an invariant component that can 
be associated with any of the four limbs, or equally 
split among limbs. The data images are vectorized and 
arranged in the columns of V. 

We applied ii- and ^2-ARD with (5 = 1 (KL- 
divergence, matching the Poisson noise assumption, and 
thus (j) = I), K ^ i2 = 2is:truc and r = 10"^. We 
tried several values for the hyperparameter a, namely 
a e {5,10,25,50,75,100,250,500,750,1000}, and set b 
according to (|38l l. For every value of a we ran the algo- 
rithms from 10 common positive random initializations. 
The regularization paths returned by the two algorithms 
are displayed in Fig. |3] £i-ARD consistently estimates 
the correct number of components (i^Ttruc = 16) up 
to a = 500. Fig. H] displays the learnt basis, objective 
function and relevance parameters along iterations in 
one run of £i-ARD when a = 100. It can be seen that 
the ground truth is perfectly recovered. 

In contrast to i'l-ARD, Fig. |3] shows that the value of 
Kgff returned by i?2-ARD is more variable across runs 
and values of a. Manual inspection reveals that some 
runs return the correct decomposition when a = 500 
(and those are the runs with lowest end value of the 
objective function, indicating the presence of apparent 
local minima), but far less consistently than £i-ARD. 
Then it might appear like the decomposition strongly 
overfits the noise for a G {750,1000}. However, visual 
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10° io' 10^ io' 



iterations 

Fig. 4. Top: Dictionary learnt in one run of ^i-ARD with 
a = 100. The dictionary elements are presented left to 
right, top to bottom, by descending order of their rele- 
vance Afc. For improved visualization and fair comparison 
of the relative importance of the dictionary elements, we 
display Wfc rescaled by the expectation of hkn, i e., for ii- 
ARD, AfcWfc. The figure colormap is then adjusted to fit the 
full range of values taken by Wdiag A. Middle: values of 
the objective function i|21]i along iterations (log-log scale). 
Bottom: values of Xk-B along iterations (log-linear scale). 



inspection of learnt dictionaries with these values show 
that the solutions still make sense. As such. Fig. |5] 
displays the dictionary learnt by £2-ARD with a = 1000. 
The figure shows that the hierarchy of the decomposition 
is preserved, despite that the last 16 components capture 
some residual noise, as a closer inspection would reveal. 
Thus, despite that pruning is not fully achieved in the 
16 extra components, the relevance parameters still give 
a valid interpretation of what are the most significant 
components. Fig. |5] shows the evolution of relevance 
parameters along iterations and it can be seen that the 
16 "spurious" components approach the lower bound 
in the early iterations before they start to fit noise. 
Note that ^2-ARD returns a solution where the torso is 
equally shared by the four limbs. This is because the 
£2 penalization favors this particular solution over the 
one returned by £i-ARD, which favors sparsity of the 
individual dictionary elements. 

With T = 10^®, the average number of iterations for 
convergence is approximately 4000 ±2000 for £i-ARD for 
all a. The average number of iterations for £2-ARD is of 
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Fig. 5. Top: Dictionary learnt by ^2-ARD witin a = 1000. 
Tine dictionary is displayed using the same convention as 
in Fig.m except that the vectors are now rescaled by 
the expectation of hun under the Half-Normal prior, i.e., 
(2Afc/7r)i/^. Middle: values of the cost function i|2l]i along 
iterations (log-log scale). Bottom: values of Afe - B along 
iterations (log-linear scale). 



the same order for a < 500, and increases to more than 
10, 000 iterations for a > 750, because all components are 
active for these a's. 



6.3 Music decomposition 

We now consider a music signal decomposition example 
and illustrate the benefits of ARD in NMF with the IS 
divergence (/3 = 0). Fevotte et al. [3 J showed that IS- 
NMF of the power spectrogram underlies a generative 
statistical model of superimposed Gaussian components, 
which is relevant to the representation of audio signals. 
As explained in Sections 13.21 and 14.51 this model is also 
equivalent to assuming that the power spectrogram is 
observed in multiplicative exponential noise, i.e., setting 
(j) = 1/a = 1. We investigate the decomposition of the 
short piano sequence used in fS), a monophonic 15 
seconds-long signal xt recorded in real conditions. The 
sequence is composed of 4 piano notes, played all at 
once in the first measure and then played by pairs in 
all possible combinations in the subsequent measures. 
The STFT Xfn of the temporal data xt was computed 
using a sinebell analysis window of length L = 1024 (46 
ms) with 50 % overlap between two adjacent frames. 
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Fig. 6. Three representations of data; (top): original 
score, (middle): time-domain recorded signal, (bottom): 
log-power spectrogram. 



leading to iV = 674 frames and F = 513 frequency 
bins. The musical score, temporal signal and log-power 
spectrogram are shown in Fig. |6l In [3] it was shown 
that IS-NMF of the power spectrogram w/„ = |x/„p can 
correctly separate the spectra of the different notes and 
other constituents of the signal (sound of hammer on the 
strings, sound of sustain pedal etc.). 

We set K = 18 (3 times the "ground truth" number 
of components) and ran ^2-ARD with /3 = 0, a = 5 
and b computed according to | |38|| . We ran the algorithm 
from 10 random initializations and selected the solution 
returned with the lowest final cost. For comparison, we 
ran standard nonpenalized Itakura-Saito NMF using the 
multiplicative algorithm described in |3|, equivalent to 
^2-ARD with Afc ^ oo and j{l3) = 1. We ran IS-NMF 
10 times with the same random initializations we used 
for ARD IS-NMF, and selected the solution with mini- 
mum fit. Additionally, we ran the methods by Morup 
& Hansen (with KL-divergence) [18J and Hoffman et al. 
[21 J. We used Matlab implementations either publicly 
available |,21J or provided to us by the authors [18J. The 
best among ten runs of these methods was selected. 

Given an approximate factorization WH of the data 
spectrogram V returned by any of the four algorithms 
we proceeded to reconstruct time-domain components 
by Wiener filtering, following [3|. The STFT estimate 
Ck,fn of component k is reconstructed by 

Wfkhkn 



Ckjn 



(40) 



and the STFT is inverted to produce the temporal com- 
ponent Cfe^tH By linearity of the reconstruction and inver- 
sion, the decomposition is conservative, i.e., xt=J2k ^fc,*- 

2. With the approach of Hoffman et al. \21\, the columns of W have 
to be multipHed by their corresponding weight parameter 9k prior to 
reconstruction. 
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(a) IS-NMF 

1 - STD = 4.72e-02 



2 - STD = 2.48e-02 



3 - STD = 2.06e-02 







4 - STD = 1 .61 e-02 



6 - STD = 8.53e-03 



(b) ARD IS-NMF 

1 - TOL = 1 .61e+04 - STD = 4.75e-02 



2 - TOL = 8.73e+03 - STD = 3.31e-02 
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3 - TOL = 3.91e+03 - STD = 2.06e-02 
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4 - TOL = 2.57e+03 - STD = 1 .39e-02 
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5 - TOL = 1 .01e+03 - STD = 8.37e-03 
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7 - STD = 2.74e-03 



8 - STD = 2.68e-03 



9 - STD = 2.23e-03 



7 - TOL = 2.46e-02 - STD = 2.00e-05 



8 - TOL = 2.35e-02 - STD = 2.1 2e-05 
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9 - TOL = 2.23e-02 - STD = 1 .85e-05 



Fig. 8. Histograms of standard deviation values of all 18 
components produced by IS-NMF, ARD IS-NMF Morup & 
Hansen [18J and Hoffman et al. [21]. ARD IS-NMF only 
retains 6 components, which correspond to the expected 
decomposition, displayed in Fig. |71 On this dataset, the 
methods proposed in [18] and [21J fail to produce the 
desired decomposition. 



1 - STD = 1 .83e-03 1 - TOL = 2.09e-02 - STD = 2.04e-05 



Fig. 7. The first ten components produced by IS-NMF 
and ARD IS-NMF STD denotes the standard deviation 
of the time samples. TOL is the relevance relative to the 
bound, i.e., (Afe - B)/B. With IS-NMF the second note of 
the piece is split into two components (k = 2 and k = 4). 



The components produced by IS-NMF were ordered 
by decreasing value of their standard deviations (computed 
from the time samples). The components produced by 
ARD IS-NMF, Morup & Hansen [18] and Hoffman et 
al. | |2T| were ordered by decreasing value of their relevance 
weights ({Afe} or {9k})- Fig.[7]displays the ten first compo- 
nents produced by IS-NMF and ARD IS-NMF. The ?/-axes 
of the two figures are identical so that the component 
amplitudes are directly comparable. Fig. [5] displays the 
histograms of the standard deviation values of all 18 
components for IS-NMF, ARD IS-NMF, Morup & Hansen 
IITSi and Hoffman et al. [21j|l 

The histogram on top right of Fig.|8]indicates that ARD 
IS-NMF retains 6 components. This is also confirmed 
by the value of relative relevance (Afe — B)/B (upon 
convergence of the relevance weights), displayed with 
the components on Fig. [7l which drops by a factor of 
about 2000 from component 6 to component 7. The 

3. The sound files produced by all the approaches ar e ava ilable at 
|http://perso.telecom-paristech.fr7~fevotte/Sampies/pamil27soundsa 



6 components correspond to expected semantic units 
of the musical sequence: the first four components ex- 
tract the individual notes and the next two components 
extract the sound of hammer hitting the strings and 
the sound produced by the sustain pedal when it is 
released. In contrast IS-NMF has a tendency to overfit, 
in particular the second note of the piece is split into 
two components (fc = 2 and fc = 4). The histogram 
on bottom left of Fig. |8] shows that the approach of 
Morup & Hansen [18] (with the KL-divergence) retains 
11 components. Visual inspection of the reconstructed 
components reveals inaccuracies in the decomposition 
and significant overfit (some notes are split in subcom- 
ponents). The poorness of the results is in part explained 
by the inadequacy of the KL-divergence (or Euclidean 
distance) for factorization of spectrograms, as discussed 
in |3[. In contrast our approach offers flexibility for ARD 
NMF where the fit-to-data term can be chosen according 
to the application by setting /3 to the desired value. 

The histogram on bottom right of Fig.[8]shows that the 
method by Hoffman et al. [21] retains approximately 5 
components. The decomposition resembles the expected 
decomposition more closely than flSlI , except that the 
hammer attacks are merged with one of the notes. 
However it is interesting to note that the distribution 
of standard deviations does not follow the order of 
relevance values. This is because the weight parameter 
Ok is independent of W and H in the prior. As such, the 
factors are allowed to take very small values while the 
weight values are not necessarily small. 

Finally, we remark that on this data ^i-ARD IS-NMF 
:jt»eriprmed similarly to ^2-ARD IS-NMF and in both cases 
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Fig. 9. Top left: The stock data; Top right: Effective model 
order Kcs as a function of a. Bottom: Normalized KL- 
divergence (NKLD) for KL-NMF (left), £i- and ^2-ARD KL- 
NMF (right). Note that the y-axes on both plots are the 
same. Morup & Hansen's method [18J yielded an NKLD 
of 0.37 ± 0.03 (averaged over 10 runs) which is inferior to 
£2-ARD in as seen on the bottom right. 



the retrieved decompositions were fairly robust to the 
choice of a. We experimented with the same values of 
a as in previous section and the decompositions and 
their hierarchies were always found correct. We point 
out that, as with IS-NMF, initialization is an issue, as 
other runs did not produced the desired decomposition 
into notes. However, in our experience the best out of 
10 runs always output the correct decomposition. 

6.4 Prediction of stock prices 

NMF (with the Euclidean and KL costs) has previously 
been applied on stock data |5| to learn "basis functions" 
and to cluster companies. In this section, we perform 
a prediction task on the stock prices of the Dow 30 
companies (comprising the Dow Jones Industrial Aver- 
age). These are major American companies from various 
sectors of the economy such as services (e.g., Walmart), 
consumer goods (e.g.. General Motors) and healthcare 
(e.g., Pfizer). The dataset consists of the stock prices of 
these F = 30 companies from 3rd Jan 2000 to 27th Jul 
2011, a total of TV = 2543 trading days@ The data are 
displayed in the top left plot of Fig. |9l 

In order to test the prediction capabilities of our algo- 
rithm, we organized the data into an F x matrix V and 



removed 50% of the entries at random. For the first set of 
experiments, we performed standard /?-NMF with /? = 1, 
for different values of K, using the observed entries 
only0 We report results for different non-integer values 
of /J? in the following. Having performed KL-NMF on the 
incomplete data, we then estimated the missing entries 
by multiplying the inferred basis W and the activation 
coefficients H to obtain the estimate V. The normalized 
KL-divergence (NKLD) between the true (missing) stock 
data and their estimates is then computed as 

NKLD^^ ^ dKL(«/„|t)/„), (41) 

where f c {1, . . . , F} x {1, . . . , N} is the set of missing 
entries and (iKL(-|-) is the KL-divergence (/? = 1). 
The smaller the NKLD, the better the prediction of the 
missing stock prices and hence the better the decom- 
position of V into W and H. We then did the same 
for 1 1- and ^2-ARD KL-NMF, for different values of 
a and using K = 25. For KL-NMF the criterion for 
termination is chosen so that it mimics that in Section l4!4l 
Namely, as is commonly done in the NMF literature, 
we ensured that the columns of W are normalized to 
unity. Then, we computed the NMF relevance weights 
^NMF A i||/ij.||2. We terminate the algorithm when- 
ever tol' 



^ maxfe |(A^^^ - A^^^ )/A^^^ I falls below 
T = 5 X 10^^. We averaged the results over 20 random 
initializations. The NKLDs and the the inferred model 
orders K^s are displayed in Fig. |9l 

In the top right plot of Fig. |9l we observe that there 
is a general increasing trend; as a increases, the inferred 
model order K^fi also increases. In addition, for the same 
value of a, ^i-ARD prunes more components than ^2- 
ARD due to its sparsifying effect. This was also observed 
for synthetic data and the swimmer dataset. However, 
even though ^2-ARD retains almost all the components, 
the basis and activation coefficients learned model the 
underlying data better. This is because £2 penalization 
methods result in coefficients that are more dense and 
are known to be better for prediction (rather than spar- 
sification) tasks. 

From the bottom left plot of Fig. |9l we observe that 
when K is too small, the model is not "rich" enough 
to model the data and hence the NKLD is large. Con- 
versely, when K is too large, the model overfits the 
data, resulting in a large NKLD. We also observe that 
^2-ARD performs spectacularly across a range of values 
of the hyperparameter a, uniformly better than standard 
KL-NMF. The NKLD for estimating the missing stock 
prices hovers around 0.2, whereas KL-NMF results in 
an NKLD of more than 0.23 for all K. This shows that 
£2-ARD produces a decomposition that is more relevant 
for modeling missing data. Thus, if one does not know 
the true model order a priori and chooses to use I2-AKD 
g^^lJ^jome hyperparameter a, the resulting NKLD would 



4. Stock prices of the Dow 30 compa- 
nies are provided at the following link: 

|http://www.optiontradingtips.com/resources/historical-data/dow-jones3 
The raw data consists of 4 stock prices per company per day. The 

mean of the 4 data points is taken to be the representative of the stock 5. Accounting for the missing data involves applying a binary mask 
price of that company for that day. to V and WH, where indicates missing entries 1381 . 
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Fig. 10. Effect of varying sinape /3 and dispersion <f) on 
prediction performance. Average results over 10 runs. 

be much better than doing KL-NMF even though many 
components will be retained. In contrast, ^i-ARD does 
not perform as spectacularly across all values of a but 
even when a small number of components is retained 
(at a = 500, K^s = 5, NKLD for £i-ARD « 0.23, NKLD 
for KL-NMF w 0.25), it performs significantly better 
than KL-NMF. It is plausible that the stock data fits the 
assumptions of the Half-Normal model better than the 
Exponential model and hence ^2-ARD performs better. 

For comparison, we also implemented a version of the 
method by Morup & Hansen |18| that handles missing 
data. The mean NKLD value returned over ten runs is 
0.37 ±0.03, and thus it is clearly inferior to the methods 
in this paper. The data does not fit the model well. 

Finally, in Fig.[lO]we demonstrate the effect of varying 
the shape parameter (3 and the dispersion parameter 
The distance between the predicted stock prices and the 
true ones is measured using the NKLD in gl]l and the 
NEUC (the Euclidean analogue of the NKLD). We also 
computed the NIS (the IS analogue of the NKLD), and 
noted that the results across all 3 performance metrics are 
similar so we omit the NIS. We used Z2-ARD, set a = 1000 
and calculated b using l|38] l. We also chose integer and 
non-integer values of /3 to demonstrate the flexibility of 
Z2-ARD. It is observed that (3 = 0.5, = 10 gives the best 
NKLD and NEUC and that 1 < /3 < 1.5 performs well 
across a wide range of values of (j). 

7 Conclusion 

In this paper, we proposed a novel statistical model for 
/?-NMF where the columns of W and rows H are tied to- 
gether through a common scale parameter in their prior, 
exploiting (and solving) the scale ambiguity between W 
and H. MAP estimation reduces to a penalized NMF 
problem with a group-sparsity inducing regularizing 
term. A set of MM algorithms accounting for all values 
of /3 and either £1- or ^2-norm group-regularization was 
presented. They ensure the monotonic decrease of the 
objective function at each iteration and result in multi- 
plicative update rules of linear complexity in F, K and 
N . The updates automatically preserve nonnegativity 



given positive initializations and are easily implemented. 
The efficiency of our approach was validated on sev- 
eral synthetic and real-world datasets, with competitive 
performance w.r.t. the state-of-the-art. At the same time, 
our proposed methods offer improved flexibility over 
existing approaches (our approach can deal with vari- 
ous types of observation noise and prior structure in a 
unified framework). Using the method of moments, an 
effective strategy for the selection of hyperparemeter b 
given a was proposed and, as a general rule of thumb, 
we recommend to set a a small value w.r.t. F + N. 

There are several avenues for further research: Here 
we derived a MAP approach that works efficiently, but 
more sophisticated inference techniques can be envis- 
aged, such as fully Bayesian inference in the model we 
proposed in Section |3l Following similar treatments in 
sparse regression |39J, |40] or with other forms of matrix 
factorization [41J, one could seek the maximization of 
logp(V|a, b, (p) using variational or Markov chain Monte- 
Carlo inference, and in particular handle hyperparame- 
ter estimation in a (more) principled way Other more 
direct extensions of this work concern the factorization 
of tensors and online-based methods akin to (42|, I43il . 
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